Why is this Java encoding UTF-8 --> Latin1 wrong?












1















I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where Descrição físicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.



It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.



This is the code I use to download (downloaded file is in infofile):



                fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();









share|improve this question























  • ? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

    – leonardkraemer
    Jan 2 at 16:56













  • The ? appears in the (correctly written Latin1) downloaded file.

    – Luis A. Florit
    Jan 2 at 17:00






  • 1





    Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

    – leonardkraemer
    Jan 2 at 17:04











  • Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

    – Luis A. Florit
    Jan 2 at 17:16













  • I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

    – leonardkraemer
    Jan 2 at 17:23


















1















I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where Descrição físicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.



It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.



This is the code I use to download (downloaded file is in infofile):



                fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();









share|improve this question























  • ? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

    – leonardkraemer
    Jan 2 at 16:56













  • The ? appears in the (correctly written Latin1) downloaded file.

    – Luis A. Florit
    Jan 2 at 17:00






  • 1





    Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

    – leonardkraemer
    Jan 2 at 17:04











  • Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

    – Luis A. Florit
    Jan 2 at 17:16













  • I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

    – leonardkraemer
    Jan 2 at 17:23
















1












1








1








I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where Descrição físicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.



It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.



This is the code I use to download (downloaded file is in infofile):



                fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();









share|improve this question














I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where Descrição físicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.



It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.



This is the code I use to download (downloaded file is in infofile):



                fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();






java android encoding






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 16:51









Luis A. FloritLuis A. Florit

8021639




8021639













  • ? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

    – leonardkraemer
    Jan 2 at 16:56













  • The ? appears in the (correctly written Latin1) downloaded file.

    – Luis A. Florit
    Jan 2 at 17:00






  • 1





    Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

    – leonardkraemer
    Jan 2 at 17:04











  • Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

    – Luis A. Florit
    Jan 2 at 17:16













  • I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

    – leonardkraemer
    Jan 2 at 17:23





















  • ? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

    – leonardkraemer
    Jan 2 at 16:56













  • The ? appears in the (correctly written Latin1) downloaded file.

    – Luis A. Florit
    Jan 2 at 17:00






  • 1





    Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

    – leonardkraemer
    Jan 2 at 17:04











  • Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

    – Luis A. Florit
    Jan 2 at 17:16













  • I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

    – leonardkraemer
    Jan 2 at 17:23



















? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56







? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56















The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00





The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00




1




1





Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04





Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04













Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16







Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16















I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23







I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23














1 Answer
1






active

oldest

votes


















3














The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.



As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.



Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.






share|improve this answer
























  • Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

    – Luis A. Florit
    Jan 2 at 18:36













  • I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

    – Luis A. Florit
    Jan 2 at 19:13








  • 1





    Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

    – Ralf Kleberhoff
    Jan 2 at 19:27











  • Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

    – Luis A. Florit
    Jan 2 at 19:30












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010172%2fwhy-is-this-java-encoding-utf-8-latin1-wrong%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.



As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.



Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.






share|improve this answer
























  • Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

    – Luis A. Florit
    Jan 2 at 18:36













  • I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

    – Luis A. Florit
    Jan 2 at 19:13








  • 1





    Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

    – Ralf Kleberhoff
    Jan 2 at 19:27











  • Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

    – Luis A. Florit
    Jan 2 at 19:30
















3














The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.



As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.



Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.






share|improve this answer
























  • Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

    – Luis A. Florit
    Jan 2 at 18:36













  • I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

    – Luis A. Florit
    Jan 2 at 19:13








  • 1





    Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

    – Ralf Kleberhoff
    Jan 2 at 19:27











  • Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

    – Luis A. Florit
    Jan 2 at 19:30














3












3








3







The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.



As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.



Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.






share|improve this answer













The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.



As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.



Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 2 at 18:00









Ralf KleberhoffRalf Kleberhoff

3,860156




3,860156













  • Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

    – Luis A. Florit
    Jan 2 at 18:36













  • I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

    – Luis A. Florit
    Jan 2 at 19:13








  • 1





    Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

    – Ralf Kleberhoff
    Jan 2 at 19:27











  • Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

    – Luis A. Florit
    Jan 2 at 19:30



















  • Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

    – Luis A. Florit
    Jan 2 at 18:36













  • I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

    – Luis A. Florit
    Jan 2 at 19:13








  • 1





    Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

    – Ralf Kleberhoff
    Jan 2 at 19:27











  • Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

    – Luis A. Florit
    Jan 2 at 19:30

















Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36







Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36















I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13







I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13






1




1





Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27





Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27













Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30





Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010172%2fwhy-is-this-java-encoding-utf-8-latin1-wrong%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter