Tuesday, March 1, 2011

é is not correctly parsed

My application will read xml from urlconnection. The xml encoding is ISO-8859-1, it contains é character. I use xerces saxparser to parse received xml content. However, é can not be parsed correctly while running application under lunix OS. Everything works fine in Windows. Could you guys please give me some hints? Thanks a lot

From stackoverflow
  • I bet this is related to file.encoding. Try running with -Dfile.encoding=iso-8859-1 as a VM parameter on linux.

    If this works, you probably need to specify the correct format when opening the stream (somewhere in your code).

  • This is probably a case of a file marked as "ISO-8859-1" when it in reality is in another encoding.

    Often this happens with "ISO-8859-1" and "Windows-2152": They are being used as if they were interchangeable, but they are not. (In the comments to this answer it has been clarified that both encodings agree on a character code for "é", so Windows-1252 is probably not it.)

    You can use a Hex editor to find out the exact char code of the "é" in your file. You can take that value as a hint to what encoding the file is in. If you have control over how the file is produced, a look at the responsible is code/method is also advisable.

    Jon Skeet : I agree with the statements about them often being confused, and them actually being different - but the e-acute is in ISO-8859-1 at U+00E9, so I suspect it's not the problem in this particular case.
    Tomalak : Then maybe the file has been saved in *yet another* encoding.
  • The first thing you should do is determining the real encoding of the xml file, as Tomalak suggests, not the encoding stated in header.

    You can start by opening it with Internet Explorer. If encoding is not correct you may see an error like this:

    An invalid character was found in text content. Error processing resource ...

    Or the following one:

    Switch from current encoding to specified encoding not supported. Error processing resource ...

    Using a text editor with several encodings support is the next step. You can use Notepad++ that is free, easy to use and supports several encodings. No matter what xml header says about encoding, the editor tries to detect encoding of the file and displays it on status bar.

    If you determine that the file encoding is correct then you may be not handling correctly the encoding inside Java. Take into account that Java strings are UTF-16 and by default when converting from/to byte arrays, if no encoding is specified Java defaults to system encoding (Windows-1521 under Windows or UTF-8 on modern Linuxes). Some encoding conversions only cause "strange" characters to appear, such as conversions between fixed 8 bit encodings (ie Windows-1252 <-> ISO-8859-1). Other conversions raise enconding exceptions because of invalid characters (try importing Windows-1252 text as UTF-8 for example).

    An example of invalid code is the following:

    // Parse the input
    SAXParser saxParser = factory.newSAXParser();
    InputStream is = new ByteArrayInputStream(stringToParse.getBytes());
    saxParser.parse( is, handler );
    

    The conversion stringToParse.getBytes() returns by default the string encoded as Windows-1252 on Windows platforms. If the XML text was encoded in ISO-8859-1 at this step you have wrong characters. The correct step should be reading XML as bytes and not a String and let SAX manage xml encoding.

  • If the XML declaration doesn't specify an encoding, the sax parser will try to use the default encoding, UTF-8.

    If you know the character encoding but it isn't specified in the XML declaration, you can tell the parser to use that encoding with an InputSource:

    InputSource inputSource = new InputSource(xmlInputStream);
    inputSource.setEncoding("ISO-8859-1");
    
    erickson : To be more precise: it *must* use UTF-8 if the encoding is not specified in the XML declaration.
    Sophie Tatham : Thanks - I thought so but wasn't certain.
  • Sorry for my late reply. We solved the problem. We did some wrong operation on the input stream (just as what Fernando Miguélez said, conversion caused problem).

    Thanks for all of you guys' help.

0 comments:

Post a Comment