Solved

Multipart Encoding Problem

Posted on 2004-08-04
18
500 Views
Last Modified: 2010-08-05
To all,
   I had recently posted and to my chagrin, closed as solved a question on the exact same topic, but now realize that I didn't fully test the solution. I also didn't have as many clues to the problem as I have now, so here it is:

   1) I'm using Javamail to send out a multipart email, half text/plain and half text/HTML.
   2) The HTML I'm sending is mostly coming a SQL Server database, from a "text" datatype field, which is a binary field. It appears to be coming out in UTF-8 formatting.
   3) The HTML contains special (upper ASCII) characters such as the TradeMark and Registered symbols. It is written to the database from a ColdFusion (yeah, I said it! ha ha) content management system called Ektron. I believe Ektron, though partially the bane of my existence is not the problem.
   4) When I look at other multipart emails, I see that the HTML part is has the Content-Transfer-Encoding: set to 7bit, but mine are coming out as quoted-printable. I believe this is the root of the problem. I believe I found out where I can fix this, but I'm not sure. I think I can set this directly at the Part/Multipart (can't remember the name right now) object level. It says setHeader, which I think is for the Part header, not the message header, which is where I've tried to set it before.
   5) I use sendmail as the mail server. I thought this was the problem for a while, but now I think that was an incorrect assumption.

   - How should I retrieve the data from SQL Server, with getBytes and create a new String with "UTF-8" formatting?
   - Is the Content-Transfer-Encoding another piece of the problem?
   - If so, am I correct in solving it at the Part level?

   I have been on this one for a while and very desperately need it solved.

   Thanks You  






0
Comment
Question by:rosettatg
  • 8
  • 7
  • 2
  • +1
18 Comments
 
LVL 35

Expert Comment

by:girionis
ID: 11723906
When you read the HTML from the db, by default, without setting any encoding, do you see the characters correctly?
0
 
LVL 1

Expert Comment

by:keesjanvoogd
ID: 11724506


hi,


The multipart delimiters and header fields are always 7-bit ASCII. Data within the body parts can be encoded on a part-by-part basis, with Content-Transfer-Encoding fields for each appropriate body part.

The 7-bit is for old email gateways that don't handle more than 7-bits.

Also make sure the data in de db is correct. G in G out as they say. And yes the utf-8 switch is a good idea.


KJ
0
 
LVL 9

Expert Comment

by:Venci75
ID: 11726880
>> - How should I retrieve the data from SQL Server, with getBytes and create a new String with "UTF-8" formatting?
Try to create a string using new String(byte[], "utf-8"); and check whether all string characters are correct. If it is not - then probably the encoding is not UTF-8

>> - Is the Content-Transfer-Encoding another piece of the problem?
I don't bevieve

>> - If so, am I correct in solving it at the Part level?
Try to set the the content type header:
setHeader("Content-Type", "text/html; charset=UTF-8");
0
 

Author Comment

by:rosettatg
ID: 11728444
To all,
   I have a few updates that should address most of the comments thus far and address one or two of my own ommisions.
    - I am now setting the Content-Transfer-Encoding to 7bit in the text/HTML BodyPart. As a few of you had predicted, this did (or didn't) do squat, depending on how much you value squat.
    - I am using String str = new String(res.getBytes("content"), "UTF-8") to get the data into a String. I've also tried using getASCIIStream as well. The other options left are getClob, and then use it's methods or getCharacterStream which returns a reader. I am now trying just getBytes to look directly at the bytes.      
    - To address the G in G out theory, which is always a possibility,the exact data from the same fields and datbase that I retrieve is successfully retrieved and sent from a ColdFusion based system. I can see no special formatting happening in the process.  
   
   This is exactly what I'm seeing so far:
    When I get the data from the database, and I parse through the String before it ever hits an email or System.out, the special characters seem to have already been translated (or encoded). I use the char hashcode to get the hex version of teh ASCII codes. The Extended ASCII characters such as ™ or ’ have codes that end up being way off from what they should be. When I replace odd hex values with the proper one, it shows ok. This solution feels ugly but I'd do it if I could just to solve this problem. The secondary issue is that although this works fine when the "bad" ASCII codes are from 0 - 31 (not used in print), but some of the special characters end up with real ASCII codes, like ™ shows up as ASCII 034 which is the normal ". I can't replace all of the double quote marks with the ™ symbol.

   I am thinking that Ektron must store the data in some way that is OK with Coldfusion, but Java or the MS JDBC driver doesn't like. I'm going to extract the bytes and parse them directly, before ever putting them into a String, and see what that shows.

   Any better ideas are more than welcome.
   Thank You

     
   
0
 
LVL 35

Expert Comment

by:girionis
ID: 11728532
> - To address the G in G out theory, which is always a possibility,the exact data from the same fields and datbase that I
>retrieve is successfully retrieved and sent from a ColdFusion based system. I can see no special formatting happening in
>the process.  

>  When I get the data from the database, and I parse through the String before it ever hits an email or System.out, the
> special characters seem to have already been translated (or encoded).

So from cf to db to cf everything works fine but from cf to db to java there are issues since the characters are already encoded into their corresponding ASCII references. It might be a driver issue. Can you try to access the database from a non-java non-cf standalone client and see if it works? Also it would help us if you tell us the SQL Server version and the driver you are using.
0
 

Author Comment

by:rosettatg
ID: 11728675
SQL Server 2000
MS JDBC Type 4 Driver

I can use MS Query Analyzer and select it into a file. I've selected into text instead of the default grid, and it shows as blocks, but in fairness that may be ok as MS Query Analyzer must use some kind of encoding to present the data back in the first place.

As am also thinking of looking into how Ektron saves teh data in the first place, I can't mess with it, but it might shed some light.
0
 

Author Comment

by:rosettatg
ID: 11728682
The Java part is running on Linux.
0
 
LVL 35

Accepted Solution

by:
girionis earned 400 total points
ID: 11729062
The only thing we really need to do is to find out what encoding Ektron is using to save the characters. We also need to make sure if it is or not a driver problem. I remember I had a few problems with the TDS driver once and when I replaced it with the default WebLogic drivers the problems were gone. So the driver is a possible problem also.

Also are you using any clients for remotely accessing MS SQL? Do you see the characters properly? Can we find out the encoding Ektron or CF is using to save the characters?
0
 

Author Comment

by:rosettatg
ID: 11730367
I'll be looking into the encoding that Ektron uses either tonight or Saturday, depending on how ugly it gets.  
Regarding the MS JDBC driver, I'm going to try the jtds driver.
I've used Query Analyzer, and the data looks the same, but I'm going to double check this information.  

   Thanks
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 35

Expert Comment

by:girionis
ID: 11733215
Ok let us know the encoding, once you know it you can easily reconstruct it by passing in the String constructor the encoding type.
0
 
LVL 1

Assisted Solution

by:keesjanvoogd
keesjanvoogd earned 100 total points
ID: 11734540


If you are going to use jtds there is a charset (extra)property in the connection string that is worth not ignoring. This needs to be set to the same value as the charset the db is running.

In the case of m$ the charset that the server was installed with ???

I think it's sort of the same for tds (property file somewhere).


KJ
0
 

Author Comment

by:rosettatg
ID: 11759324
Update:
   I found that about 8 out of the 200 plus NewsLetters work fine with my programming as I first had it about a month ago. That is to say with a simple getString method with no explicit conversion. When these emails went out in the past, they have all of the symbols that the others have, which is to point out that they are in no way more basic than the NewsLetters that don't work for me.
   This leads me to focus on what was different in the creation process of the few "good" NewsLetters from the creation of the many "bad" NewsLetters. My gut is saying that somewhere in the cutting, pasting and merging that goes into the creation process, the symbols are getting some Word formatting or something similar caught up in the encoding. I'm going to look at that end of the process to make the "good" process a repeatable one.
0
 

Author Comment

by:rosettatg
ID: 11826972
Just to keep any interested parties up to date, I have yet to solve this, but I've moved down a new path or two. After installing the Ektron ActiveX control needed to add in the data the way the client does, I found that I could not yet duplicate the few successful newsletters that are already in the database. I also noticed that if I view the "bad" newsletters in MS SQL Query Analyzer, they look fine, with all of the special characters showing up properly. This is leading me down the alternate JDBC driver path, thinking that maybe Microsoft's driver is the problem.

So at this point I'm continuing experimenting entering the data trying to find the formula of the 8 or so "good" entries, an I'm pursuing the JDBC driver angle.  
0
 
LVL 35

Expert Comment

by:girionis
ID: 11827915
Can you find out the encoding of these "good" 8 letters?
0
 

Author Comment

by:rosettatg
ID: 11875362
I did a good bit of work on this problem this past weekend, better identified the heart of the problem and the boundaries of where my solution needed to fit, and I believe I solved* it as well. I use an "*" next to the claim of solved, because although it looks like it works, I just can't believe that what I did was the only solution.

The Core Problem has a few stable facts, which I cannot for various reasons, change:
    1) The datatype is Text. It's possible that were it ntext, which holds unicode data at twice the size, it would be easier for me to deal with in Java.
    2) The only characters that give me a problem are in the Windows Latin 1 range, from 129 to 15?. Over and under are fine, so © and ®, but ™ is not. This means that ISO-8859-1 works, but Windows Latin 1 does not.
    2) Ektron writes the special characters already encoded, meaning that when I use query analyzer to view the data, I see "©". It has the ability to either write them as #&153; or &copy, but it's not quite working out that way. It's easy to blame the product, but when I hunted around a little in the ColdFusion code and documentation, it can write it as I need it, and the documentation talks directly to what I wanted to do, which was to save the data in it's numeric format. It looked like the configuration file was set up to do so, but something was keeping it from working like that most of the time. If you remember I've used the terms "good" and "bad" referring to the newsletters, well the few good ones used the #&153; format, and the bad one saved the characters.
       It's too dangerous and messy to play with the Ektron end of this production level system, so I had to let that end of the problem be a fixed constant.

THE SOLUTION: Roll my own encoding for the Windows Latin characters that were getting mis-encoded. I did this by creating a pure byte array of the data, pushing the characters into a StringBuffer, and using the byte array as an index into the SringBuffer to let me know which characters needed to be changed. The bytes for the special characters came out as negative numbers, in a pattern, so it looks like this will work, but I can't help thinking there would be a better way.

   I realize that the clean solution would be to enter the data into the database in a way that would facilitate clean extraction, but I don't have control over that aspect of the system.

   Right now I'm using the JTDS driver, with charset=ISO-8859-1, getting the bytes and StringBuffer, and manipulating the characters as described above. If anyone knows of an encoding option that would work, considering the formatting of the data in the database, I'm all ears. It is not UTF-8, US-ASCII, UTF-16 or CP1252.
   
   Thanks for the advice so far. I will say at the least, I like the JTDS driver much better than the Microsoft one.  
   
   
0
 
LVL 35

Expert Comment

by:girionis
ID: 11878870
> The bytes for the special characters came out as negative numbers, in a pattern, so it looks
> like this will work, but I can't help thinking there would be a better way.

If it works stick with it :)

>  I realize that the clean solution would be to enter the data into the database in a way that
> would facilitate clean extraction, but I don't have control over that aspect of the system.

Then I guess this is not a solution.

>  If anyone knows of an encoding option that would work, considering the formatting of the data
> in the database, I'm all ears.

The only encoding that it woudl work would be the same that the data is written to the database. But you will have to have control on the part that writes to the database.

0
 

Author Comment

by:rosettatg
ID: 12011146
I'm closing this one out. I'll get to the final solution in a moment, but before that I'll express a minor gripe about Experts-Exchange. I would like to be able to only give out about 200 points, partially fo being a sounding board and partially for turning me on to a better JDBC driver, but I can't. If I split points, I have to give them all out, or nothing at all, and neither option feels right. Taking the lesser of two evils, I'm giving them all out.

The solution in the end was to extract the data as bytes and convert them into strings like "™", which works best for HTML. I needed to do this only for the characters in the Windows Latin character subset of ISO-8859-1. Simple, ugly and working.

Thanks to anyone that helped along the way.  
0
 
LVL 35

Expert Comment

by:girionis
ID: 12013988
Glad you got it working :)

If you want to only assign 200 points please ask a question to community support (http://www.experts-exchange.com/Community_Support/) for zero points to reopen this question for you and reduce the points.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
This video teaches viewers about errors in exception handling.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now