Link to home
Start Free TrialLog in
Avatar of joibrooks
joibrooksFlag for United States of America

asked on

Remove HTML Code by Query from a Field in Microsoft Access

i manage monthly uploads to a number of different shopping channels. google base is one of these channels. the data that i have comes preformatted by my client. for google base, the feed can be in text delimited format. i manage this data in microsoft access 2003. recently, my feeds are being disapproved because they have changed their requirements. a required field is a description of the product but cannot be the same as the product title. the description field that i have in the preformatted file from my client includes HTML code, with many different tags. i need to remove all these tags before upload. obviously each tag begins with "<" and ends with ">", but there are many variations within the tag.

my question is, how do i write this query (queries)? query (queries) would be run monthly, as the client sends me a new database for upload every month.

i greatly appreciate your expertise and look forward to your helpful resolution.

the database holds >20,000 records.
Avatar of Gustav Brock
Gustav Brock
Flag of Denmark image

You can easily write a clean-up function with the free ActiveX String Component from ChilkatSoft:

http://www.example-code.com/asp/asp_html_entity_decode.asp 

<quote>
The Chilkat ASP String Component is freeware and may be used in any ASP
application. To install, download and unzip the ASP String Component.
Then copy the CkString.dll to a directory on your ASP Web Server and
register it with regsvr32.exe.
The CkString.dll is a standard ActiveX component. It may also be used in
Visual Basic 6.0, Delphi, Cold Fusion, Visual FoxPro, PowerBuilder, MFC,
or any other programming language that supports ActiveX components.
</quote>

http://www.example-code.com/asp/aspstring.asp 

/gustav
Example code here:
 
<snip>
set cks = Server.CreateObject("CkString.CkString")
 
' Strip HTML from a string.
cks.Str = "<html><body><p>This is a test<br>Here is a <a
href=""http://www.chilkatsoft.com"">test link</a></p></body></html>"
cks.StripHtml
' Prints "This is a testHere is a test link"
Response.Write cks.Str + "<br>"
 
' Perhaps it's better to replace <br> tags with SPACE chars first...
cks.Str = "<html><body><p>This is a test<br>Here is a <a
href=""http://www.chilkatsoft.com"">test link</a></p></body></html>"
cks.ReplaceAll "<br>"," "
cks.ReplaceAll "<BR>"," "
cks.StripHtml
' Prints "This is a test Here is a test link"
Response.Write cks.Str
</snip>

Open in new window

Avatar of joibrooks

ASKER

i don't have access to an asp server.

i was trying to develop an update query that would find all instances of:

like "<*>"

and replace it with

""
without removing the rest of the character data in the string. i'm looking through all my resource books and the internet to find the correct syntax.

thanks for your response. i wish i could utilize the resolution.
> without removing the rest of the character data in the string

If so, just use Replace:

strClean = Replace(Replace(strHtml, "<", ""), ">", "")

/gustav
getting warmer. something like building a query with an expression like:

Replace([LongDesc],"<*","",">","")

but this isn't working. i'm getting a data type mismatch in query expression.
Try with:

Replace(Replace([LongDesc], "<", ""), ">", "")

/gustav
darn! LongDesc is set to Memo datatype and not Text. the field often holds more than 255 characters. what next?
Try with:

Replace(Replace(CStr([LongDesc]), "<", ""), ">", "")

/gustav
I think I see some confusion here.
gustav is replace two specific characters, and not the entire tag. While there may be a way to invoke regular expressions in a query with some sort of UDF, I do not know it. If replace("<*", "") worked, you word be deleting everything after the first tag.
Did you try the link I posted?
getting "invalid use of null" error message. btw...thanks for hanging in there...
yes jerry. i did look at that and tried building the function but got a compile error.

user-defined type not defined

referring to Dim varRegex As RegExp

??
Did you add the reference to regular expressions I also mentioned?
Try with:

Replace(Replace(CStr(Nz([LongDesc], "")), "<", ""), ">", "")

/gustav
Replace(Replace(CStr(Nz([LongDesc], "")), "<", ""), ">", "")

WORKS!!!!!!!

can you explain the syntax. i'd like to understand the Nz function.

:c)
As far as I can tell, this only removes < and > from field, and not the tags.  If that is what you wanted, fine, but it is not what you asked for in original question. :)
Nz() replaces a Null value with something else, here a zero length string.

Replace must have a string to read and CStr() takes care of that, though it should not be needed, as memos normally are read as text. You could try with:

Replace(Replace(Nz([LongDesc], ""), "<", ""), ">", "")

/gustav
jerry, i really appreciate your effort to resolve my problem correctly.

i did test the function in Access through a select query in design mode and it produced an exportable result. perhaps i didn't word my original request properly. the replace function that gustav provided removes  opening"<" tags and closing ">" tags and everything between them. this function leaves me with straight text, which is my goal.

-jb
That is, of course, the purpose of allowing comments after a question is closed.  And for allowing a question to be re-opened for scoring.
In this case, /gustav provided me such a wealth of knowledge, I do not begrudge him the points.
jerryb30 - you were correct!!!!!. what i did not notice was that the replace function was filtering my dataset. how may i reopen this item?

i am now attempting your VB solution and am trying to understand what you meant by:

"add the reference to regular expressions I also mentioned?"

thank you!
Sorry, our internet has been down.
In your module, select Tools, References.
Scroll down the list until you see the one for regular expressions:
 Microsoft VBScript Regular Expressions 5.5
and check the box next to it.
okay, so the script works as jerryb30 detailed in his last discussion note. but now...

1. some of the records in this new field (i'm calling this nothtml field description) display "#Error"
2. the text in most of these fields have carriage returns (as identified by a box character)
3. the export tab delim file errors out because of number 1 and 2
4. there are fields that are null, but they were null before we ran the select query

(I am thinking this is another question.) Without the actual DB, I cannot think of how to diagnose this issue. If it is a matter of how the text is formatted before it is imported, maybe it can be fixed before this is done.
Do the null fields give #Error##? Is this in the db, or in the exported data?
the db comes to me from my client and is littered with inconsistent keystrokes. in this field that i'm trying to standardize there may be null or " " data, text data or html coded data. there is no standard procedure for data input. the datatype is MEMO.

the "#Error" is displayed in the newly created 'nohtml' field - which i've named [description] - generated by our SELECT query. "#Error" is not in every field that appears to be blank. i am thinking that this error message is caused by empty fields or fields with " " data.

as i mentioned, this data is going to Google's shopping channel, called Google Base (aka Froogle). the datafeed is processed but fails to meet their standards. if i push the data as it is with html in the [description] field, the upload is flat-out terminated. with the carriage returns generated by the stripped html data, the text delim file has crazy line breaks. while the upload completes... the file is rejected because it appears to have too many fields. this [description] field is required and Google is apparently running a validity check on the field's data - the field cannot contain the same data as displayed in the record's [title] field, it cannot contain canned marketing phrases, it cannot contain the same data as other records.

i have been pushing this same data as provided by my client on a monthly basis for over five years to various shopping channels. each channel has different requirements. as the channels change their requirements, i further manipulate the data. this current requirement from Google has me stumped.

a little further history of this db: the client uses this data as-is to populate their primary e-commerce system.

i'm upping the point value to 500. this isn't as simple as we all first thought it would be.
> i did not notice was that the replace function was filtering my dataset.

Well, when you remove something from the data, you are filtering it whatever method you use.

/gustav

Again, sorry for delay in response. Without trying to recreate your situation, inconsistent inputs are going to be a problem.  While HTML has substitutes for CR/LF, which works well when reading with a browser, it can play havoc with your display.  The inability to export a text-type memo field with a CR seems a little odd.  Maybe playing with the NZ function will take away for first #ERROR# problems, an allow the export.
Otherwise, you might have to post a sample DB with a few records which represent a range of inconsistent inputs at www-ee-stuff.com


i've uploaded a +3000 record zipped db with a variety of entries that demonstrate the issues i'm having manipulating the data. thank you!
Please post your exact SQL which gives the errors.
I will look at this when I can, but cannot promise immediate response.
jerryb30 - i can't thank you enough for your patience. i appreciate all that you can do for me.

here is the SQL statement:

SELECT tabCatalog.ID AS id, tabCatalog.Price AS price, "http://www.companyname.com/default.asp?CatalogID=" & [catalogID] & "&SubfolderID=76&AssocID=froogle" AS link, IIf(tabcatalog.imagefilename="" Or tabcatalog.imagefilename Is Null,"http://www.companyname.com/merchant/logo.gif","http://www.companyname.com/merchant/" & tabcatalog.imagefilename) AS image_link, "2007-12-30" AS expiration_date, IIf([desc4]="" Or [desc4] Is Null,"Vitamins & Supplements",Trim([desc4])) AS product_type, tabCatalog.Desc6 AS brand, "US:Ground:6.95" AS shipping, tabCatalog.Desc9 AS upc, [desc1] & " - " & [ID] AS title, "New" AS condition, tabCatalog.ID AS mpn, REP(tabCatalog.LongDesc) AS description INTO froogletable
FROM tabCatalog
WHERE (((tabCatalog.ID) Not Like "mpi-*"));
ASKER CERTIFIED SOLUTION
Avatar of jerryb30
jerryb30
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
jerryb30 has well-earned his rank as GENIUS. many many thanks.