Code Question: Handling UTF-8 encoding

We have an Java application running on Weblogic server that picks up XML messages from a JMS or MQ queue and writes it into another JMS queue. The application doesn't modify the XML content in any way. We use BEA's XMLObject to read and write the messages into queues.

The XML messages contain the encoding type declarations as UTF-8.

We have an issue when the XML contains characters that are out side the normal ASCII range (like £ symbol for example). When the message is read from the queue we can see that the £ symbol is intact, however once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead.

I have checked the OS level settings (locale settings) and everything seems to be fine. What else should I be checking to make sure that this doesn't happen?

From stackoverflow

Without a few more specifics, I'd guess that there is a method that optionally takes an encoding somewhere that isn't specified and is defaulting to ISO-8859-1. Commonly, check anything that passes between an InputStream/OutputStream and a Reader/Writer.

For instance, an OutputStreamWriter takes an optional encoding that you could be leaving out.
once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead

That tells me the character is being written as UTF-8, but it's being read as if it were in a single-byte encoding like ISO-8859-1. (For any character in the range U+00A0..U+00BF, if you encode it as UTF-8 and decode it as ISO-8859-1, you end up with the two-character sequence ÃX, where X is the original character.) I would look at the encoding settings of the receiving JMS queue.

Mani : Yes. It was an issue with the encoding setting, not at the JMS queue, but at the OS level (which I thought was correct and mentioned so in my original query).

Alan Moore : I'm glad you figured it out, and I hope you're taking the advice offered in the other replies: if you really have to do the byte/character conversions yourself, you should always specify the encoding instead of relying on the OS settings.
You should use InputStream, OutputStream, and byte[] to handle XML documents, not Reader, Writer, and String. In the world of JMS, BytesMessage is a better fit for XML payloads than TextMessage.

Every XML document specifies its character encoding internally, and all XML processing APIs are oriented to take byte streams and where necessary figure out the correct character encoding to use themselves. The text-based APIs are only there… to confuse people, I guess! Anyway, applications should let the XML processor deal with character encoding issues, rather than trying to manage it themselves (or using a text-oriented API without a solid understanding of character-encoding issues).

Code Question

Thursday, February 17, 2011

Handling UTF-8 encoding

0 comments:

Post a Comment

Blog Archive