Encoding issues when reading files and comparing strings

April 23, 2013 - 4:21pm
Submitted by admin

When working with special characters we sometimes run into the issue where the string is not recognizing these characters.
Especially when trying to parse files from one platform from another.

I had the following line in a txt file on windows: (saved as ANSI)

éèêë

I had the following code that parses and replaces these characters:
(Java file is saved as UTF-8)
-------------------
File file = new File("/windowsPathMappedOnLinux");
FileInputStream fis = null;
BufferedInputStream bis = null;
DataInputStream dis = null;
String fileContent = "";
fis = new FileInputStream(file);
bis = new BufferedInputStream(fis);
dis = new DataInputStream(bis);
BufferedReader input = new BufferedReader(
new InputStreamReader(dis));
fileContent = input.readLine();
fis.close();
bis.close();
dis.close();
fileContent = fileContent .replace("é", "e");
fileContent = fileContent .replace("è", "e");
fileContent = fileContent .replace("ê", "e");
fileContent = fileContent .replace("ë", "e");
---------------------------------------
BUT THIS DIDNT WORK!!

Problems:
1. the text file is saved as ANSI, which uses the default windows encoding of Cp1252 or ISO-8859
but on linux my file encodings are UTF-8

2. the java file uses UTF-8 encoding but we are trying to parse a file on ANSI.
changing the file encoding of the java file to CP1252 results to:
---------------------
fileContent = fileContent .replace("é", "e");
fileContent = fileContent .replace("è", "e");
fileContent = fileContent .replace("ê", "e");
fileContent = fileContent .replace("ë", "e");
---------------------

Solution:

1. we need to explicitly specify the encoding when parsing the file by the line:

BufferedReader input = new BufferedReader(
new InputStreamReader(dis, "Cp1252"));

2. we need to ensure that the RIGHT characters are in our java file.
we can do this by changing the encoding of the java file then paste the correct characters as that encoding, then save it.
the text in your java file need to show the correct characters when you are in your parsing encoding.