Java Encoding Issues

January 4, 2010 - 8:17pm
Submitted by gary

    Anyone worked with multi language applications will know how frustrating it is to deal with special characters and encodings.

   Even now, I can never get it quit right with those applications, the special characters always manage to mutate itself to some funky character during our applications existents. Below is a short list of fixes to help with multi-language developments.

   Here are my recent encounters and recent solutions.

 

1. UTF-8 is the way to go:

    Depending on your machine, the default encoding for Java is usually Cp1252. And that should suffice for most development work.

    But if you really need special characters, UTF-8 is so far the best multi-language support encoding we have available. So make sure we use this encoding for all our file types before proceeding with any development.

    In Eclipse, to change file encoding types go to Preferences => General => Content Types.

                     from the content types, make sure each and every file type is of UTF-8 encoding.

        Preference => XML should be of UTF-8 as well.

       Those who builds web applications, make sure under Preferences => Web, all file type like css, html etc. are also in UTF-8. (Better safe than sorry)

 

2.When parsing files and other source streams, KNOW their encoding.

   To convert encoding from one type to another:

    String unicodeString = new String("test test");
    byte[] uft8Bytes = unicodeString.getBytes("UTF8");
    String uft8String =new String(utf8Bytes, "UTF8");

3. When putting in special characters, use the unicode escape \u1D66,  \u00fc rather than the funky characters like: β, ü when you the text file encoding does not support these characters.

        Example:

              You get the error:

     

             String s = "special characters: \u1d66 and \u00fc";

             This will result in: "special characters: β and ü"

        Be cautious when using these unicode escapes in tests. My test suite just dies if a unicode is used in a test that the current encoding does not support.

 

Note on Servlet wrappers:

    For Jboss, make sure to add in -Dfile.encoding=UTF8 to the JAVA_OPTS environment variable.

    For Tomcat, make sure to add in -Dfile.encoding=UTF8 to the CATALINA_OPTS environment variable

 
    Also, the safest bid when reading from a stream is to specify the stream encoding:

        Example: //For reading files from a windows box, usualy is Cp1252 encoding, so we declare our reader as follows:

                       FileInputStream stream = new FileInputStream("inputfile");

                       InputStreamReader reader = new InputStreamReader(stream, "Cp1252");